Paper Reading
AI Chip/Accelerator
- Nature
- Neuro-inspired computing chips(2020)
- Illusion of large on-chip memory by networked computing chips for neural network inference(2021)
- Neuromorphic computing at scale(2025)
- Science
- Edge learning using a fully integrated neuro-inspired memristor chip(2023)
- ISSCC or ISSCC
- A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems(2016)
- A 288μW Programmable Deep-Learning Processor with 270KB On-Chip Weight Storage Using Non-Uniform Memory Hierarchy for Mobile Intelligence(2017)
- A 0.62mW Ultra-Low-Power Convolutional-NeuralNetwork Face-Recognition Processor and a CIS Integrated with Always-On Haar-Like Face Detector(2017)
- A 2.9TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems(2017)
- DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks(2017)
- UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision(2018)
- A 65nm 4Kb Algorithm-Dependent Computing-inMemory SRAM Unit-Macro with 2.3ns and 55.8TOPS/W Fully Parallel Product-Sum Operation for Binary DNN Edge Processors(2018)
- 7.6 A 65nm 236.5nJ/Classification Neuromorphic Processor with 7.5% Energy Overhead On-Chip Learning Using Direct Spike-Only Feedback(2019)
- A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips(2020)
- AMD Chiplet Architecture for High-Performance Server and Desktop Products(2020)
- 15.3 A 351TOPS/W and 372.4GOPS Compute-in-Memory SRAM Macro in 7nm FinFET CMOS for Machine-Learning Applications(2020)
- A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing(2021)
- An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In-Memory Macro in 22nm for Machine-Learning Edge Applications(2021)
- A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations(2022)
- DIMC: 2219TOPS/W 2569F2/b Digital In-Memory Computing Macro in 28nm Based on Approximate Arithmetic Hardware(2022)
- A 4nm 6163-TOPS/W/b \mathbf4790-TOPS/mm^2/b SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update(2023)
- 34.4 A 3nm, 32.5TOPS/W, 55.0TOPS/mm2 and 3.78Mb/mm2 Fully-Digital Compute-in-Memory Macro Supporting INT12 × INT12 with a Parallel-MAC Architecture and Foundry 6T-SRAM Bit Cell(2024)
- JSSC or JSSC
- Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks(2017)
- A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications(2018)
- Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning(2021)
- TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes(2023)
- ReDCIM: Reconfigurable Digital ComputingIn-Memory Processor With Unified FP/INT Pipeline for Cloud AI Acceleration(2023)
- IEDM or IEDM
- NeuroSim+: An integrated device-to-algorithm framework for benchmarking synaptic devices and array architectures(2017)
- DNN+NeuroSim: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators with Versatile Device Technologies(2019)
- AI Computing in Light of 2.5D Interconnect Roadmap: Big-Little Chiplets for In-memory Acceleration(2022)
- Design of Analog-AI Hardware Accelerators for Transformer-based Language Models(2023)
- VLSI or VLSI
- A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications(2017)
- A 40nm Analog-Input ADC-Free Compute-in-Memory RRAM Macro with Pulse-Width Modulation between Sub-arrays(2022)
- A 12nm 121-TOPS/W 41.6-TOPS/mm2 All Digital Full Precision SRAM-based Compute-in-Memory with Configurable Bit-width For AI Edge Applications(2022)
- A 12nm 137 TOPS/W Digital Compute-In-Memory using Foundry 8T SRAM Bitcell supporting 16 Kernel Weight Sets for AI Edge Applications(2023)
- ISCA or ISCA
- ShiDianNao: shifting vision processing closer to the sensor(2015)
- ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars(2016)
- PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory(2016)
- EIE: efficient inference engine on compressed deep neural network(2016)
- In-Datacenter Performance Analysis of a Tensor Processing Unit(2017)
- SCALEDEEP:A Scalable Compute Architecture for Learning and Evaluating Deep Networks(2017)
- SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks(2017)
- SpinalFlow: An Architecture and Dataflow Tailored for Spiking Neural Networks(2020)
- ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks(2021)
- A software-defined tensor streaming multiprocessor for large-scale machine learning(2022)
- MICRO or MICRO
- Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture(2019)
- Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture(2021)
- Si-Kintsugi: Towards Recovering Golden-Like Performance of Defective Many-Core Spatial Architectures for AI(2023)
- SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling(2024)
- ASPLOS or ASPLOS
- DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning(2014)
- PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference(2019)
- DOTA: detect and omit weak attentions for scalable transformer acceleration(2022)
- TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays(2024)
- DAC or DAC
- Atomlayer: a universal ReRAM-based CNN accelerator with atomic layer computation(2018)
- A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM(2019)
- A Two-way SRAM Array based Accelerator for Deep Neural Network On-chip Training(2020)
- HERO: hessian-enhanced robust optimization for unifying and improving generalization and quantization performance(2022)
- Chiplet actuary: a quantitative cost model and multi-chiplet architecture exploration(2022)
- PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators(2023)
- AutoDCIM: An Automated Digital CIM Compiler(2023)
- A Convolution Neural Network Accelerator Design with Weight Mapping and Pipeline Optimization(2023)
- PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators(2023)
- Chiplets: How Small is too Small?(2023)
- HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement(2023)
- HPCA or HPCA
- PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning(2017)
- A3: Accelerating Attention Mechanisms in Neural Networks with Approximation(2020)
- SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning(2021)
- Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing(2021)
- MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores(2022)
- TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer(2022)
- Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators(2024)
- Lightening-Transformer: A Dynamically-operated Optically-interconnected Photonic Transformer Accelerator(2024)
- Prosperity: Accelerating Spiking Neural Networks via Product Sparsity(2025)
- ISLPED or ISLPED
- A Fully-Integrated Energy-Scalable Transformer Accelerator Supporting Adaptive Model Configuration and Word Elimination for Language Understanding on Edge Devices(2023)
- ICCAD or ICCAD
- Scaling the “memory wall”(2012)
- OpenRAM: An open-source memory compiler(2016)
- Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs(2019)
- MAGNet: A Modular Accelerator Generator for Neural Networks(2019)
- ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration(2020)
- GAMMA: automating the HW mapping of DNN models on accelerators via genetic algorithm(2020)
- Multi-Objective Optimization of ReRAM Crossbars for Robust DNN Inferencing under Stochastic Noise(2021)
- Design Space and Memory Technology Co-Exploration for In-Memory Computing Based Machine Learning Accelerators(2022)
- Big-Little Chiplets for In-Memory Acceleration of DNNs: A Scalable Heterogeneous Architecture(2022)
- GLSVLSI or GLSVLSI
- Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors(2021)
- DATE or DATE
- CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory(2012)
- ReCom: An efficient resistive accelerator for compressed deep neural networks(2018)
- TDO-CIM: Transparent Detection and Offloading for Computation In-memory(2020)
- A Fast and Energy Efficient Computing-in-Memory Architecture for Few-Shot Learning Applications(2020)
- Shenjing: A low power reconfigurable neuromorphic accelerator with partial-sum and spike networks-on-chip(2020)
- A Runtime Reconfigurable Design of Compute-in-Memory based Hardware Accelerator(2021)
- In-Memory Computing based Accelerator for Transformer Networks for Long Sequences(2021)
- Gibbon: Efficient Co-Exploration of NN Model and Processing-In-Memory Architecture(2022)
- Achieving Datacenter-scale Performance through Chiplet-based Manycore Architectures(2023)
- SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors(2024)
- ASP-DAC or ASP-DAC
- ReGAN: A pipelined ReRAM-based accelerator for generative adversarial networks(2018)
- Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator(2019)
- This is SPATEM! A Spatial-Temporal Optimization Framework for Efficient Inference on ReRAM-based CNN Accelerator(2022)
- Improving the Robustness and Efficiency of PIM-Based Architecture by SW/HW Co-Design(2023)
- A Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks(2023)
- MINT: Multiplier-less INTeger Quantization for Energy Efficient Spiking Neural Networks(2024)
- ISCAS or ISCAS
- Optimizing Weight Mapping and Data Flow for Convolutional Neural Networks on RRAM Based Processing-In-Memory Architecture(2019)
- MINT: Mixed-Precision RRAM-Based IN-Memory Training Architecture(2020)
- An 8T SRAM Based Digital Compute-In-Memory Macro For Multiply-And-Accumulate Accelerating(2023)
- MWSCAS or MWSCAS
- 8T XNOR-SRAM based Parallel Compute-in-Memory for Deep Neural Network Accelerator(2020)
- Open-Source Memory Compiler for Automatic RRAM Generation and Verification(2021)
- TC or TC
- CIMAT: A Compute-In-Memory Architecture for On-chip Training Based on Transpose SRAM Arrays(2020)
- Device-Circuit-Architecture Co-Exploration for Computing-in-Memory Neural Accelerators(2021)
- TCAS-I or TCAS-I
- Research Progress on Memristor: From Synapses to Computing Systems(2022)
- ENNA: An Efficient Neural Network Accelerator Design Based on ADC-Free Compute-In-Memory Subarrays(2023)
- TCAD or TCAD
- MNSIM: Simulation Platform for Memristor-Based Neuromorphic Computing System(2018)
- DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training(2021)
- OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory(2022)
- SWAP: A Server-Scale Communication-Aware Chiplet-Based Manycore PIM Accelerator(2022)
- H2Learn: High-Efficiency Learning Accelerator for High-Accuracy Spiking Neural Networks(2022)
- ESSENCE: Exploiting Structured Stochastic Gradient Pruning for Endurance-Aware ReRAM-Based In-Memory Training Systems(2023)
- A Coordinated Model Pruning and Mapping Framework for RRAM-Based DNN Accelerators(2023)
- AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers(2023)
- SATA: Sparsity-Aware Training Accelerator for Spiking Neural Networks(2023)
- SpikeSim: An End-to-End Compute-in-Memory Hardware Evaluation Tool for Benchmarking Spiking Neural Networks(2023)
- TVLSI or TVLSI
- Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns(2017)
- Benchmark of the Compute-in-Memory-Based DNN Accelerator With Area Constraint(2020)
- An Algorithm–Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers(2022)
- A 40-nm 1.89-pJ/SOP Scalable Convolutional Spiking Neural Network Learning Core With On-Chip Spatiotemporal Back-Propagation(2023)
- Others
- Steps toward Artificial Intelligence(1961)
- Analyzing CUDA workloads using a detailed GPU simulator(2009)
- DRAMSim2: A Cycle Accurate Memory System Simulator(2011)
- Unsupervised learning of digit recognition using spike-timing-dependent plasticity(2015)
- Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks(2016)
- Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1(2016)
- Ramulator: A Fast and Extensible DRAM Simulator(2016)
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey(2017)
- FINN: A Framework for Fast, Scalable Binarized Neural Network Inference(2017)
- HBM (High Bandwidth Memory) DRAM Technology and Architecture(2017)
- CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories(2017)
- A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA(2018)
- NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning(2018)
- Motivation for and Evaluation of the First Tensor Processing Unit(2018)
- NVIDIA Tensor Core Programmability, Performance & Precision(2018)
- Loihi: A Neuromorphic Manycore Processor with On-Chip Learning(2018)
- Training Deep Spiking Convolutional Neural Networks With STDP-Based Unsupervised Pre-training Followed by Supervised Fine-Tuning(2018)
- mRNA: Enabling Efficient Mapping Space Exploration for a Reconfiguration Neural Accelerator(2019)
- In-Memory Computing: Advances and Prospects(2019)
- Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices(2019)
- Device and materials requirements for neuromorphic computing(2019)
- Timeloop: A Systematic Approach to DNN Accelerator Evaluation(2019)
- Exploring Bit-Slice Sparsity in Deep Neural Networks for Efficient ReRAM-Based Deployment(2019)
- Modeling Deep Learning Accelerator Enabled GPUs(2019)
- MLP+NeuroSimV3.0: Improving On-chip Learning Performance with Device to Algorithm Optimizations(2019)
- Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks(2019)
- SpykeTorch: Efficient Simulation of Convolutional Spiking Neural Networks With at Most One Spike per Neuron(2019)
- Bio-inspired digit recognition using reward-modulated spike-timing-dependent plasticity in deep convolutional networks(2019)
- Efficient spiking neural network training and inference with reduced precision memory and computing(2019)
- MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems(2020)
- An Architecture-Level Energy and Area Estimator for Processing-In-Memory Accelerator Designs(2020)
- Compressing Large-Scale Transformer-Based Models: A Case Study on BERT(2020)
- Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer(2020)
- HAT: Hardware-Aware Transformers for Efficient Natural Language Processing(2020)
- DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator(2020)
- OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator(2020)
- Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis(2020)
- A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim(2020)
- Compute-in-RRAM with Limited On-chip Resources(2021)
- Compute-in-Memory Chips for Deep Learning: Recent Trends and Prospects(2021)
- NeuroSim Simulator for Compute-in-Memory Hardware Accelerator: Validation and Benchmark(2021)
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware(2021)
- SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks(2021)
- Wafer Level System Integration of the Fifth Generation CoWoS®-S with High Performance Si Interposer at 2500 mm2(2021)
- VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference(2021)
- Revisiting Batch Normalization for Training Low-Latency Deep Spiking Neural Networks From Scratch(2021)
- Q-SpiNN: A Framework for Quantizing Spiking Neural Networks(2021)
- SSTDP: Supervised Spike Timing Dependent Plasticity for Efficient Spiking Neural Network Training(2021)
- Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication(2022)
- Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs(2022)
- ULECGNet: An Ultra-Lightweight End-to-End ECG Classification Neural Network(2022)
- Design Methodology and Trends of SRAM-Based Compute-in-Memory Circuits(2022)
- Tolerating Noise Effects in Processing-in-Memory Systems for Neural Networks: A Hardware–Software Codesign Perspective(2022)
- On Building Efficient and Robust Neural Network Designs(2022)
- ReaLPrune: ReRAM Crossbar-aware Lottery Ticket Pruned CNNs(2022)
- From Macro To Microarchitecture: Reviews and Trends of SRAM-Based Compute-in-Memory Circuits(2023)
- Side-Channel Attack Analysis on In-Memory Computing Architectures(2023)
- An Ultra-Low Power TinyML System for Real-Time Visual Processing at Edge(2023)
- Hardware-aware Quantization/Mapping Strategies for Compute-in-Memory Accelerators(2023)
- Wafer-scale Computing: Advancements, Challenges, and Future Perspectives(2023)
- Neuro-Symbolic Computing: Advancements and Challenges in Hardware-Software Co-Design(2023)
- Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators(2023)
- A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models(2023)
- Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator(2023)
- End-to-End Benchmarking of Chiplet-Based In-Memory Computing(2023)
- Performance Impact of Architectural Parameters on Chiplet-Based IMC Accelerators(2023)
- The Big Chip: Challenge, Model and Architecture(2023)
- The Rise and Potential of Large Language Model Based Agents: A Survey(2023)
- ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers(2023)
- Knowledge Distillation between DNN and SNN for Intelligent Sensing Systems on Loihi Chip(2023)
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits(2024)
- FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs(2024)
- Weight Update Scheme for 1T1R Memristor Array Based Equilibrium Propagation(2024)
- AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology(2024)
- ChipNeMo: Domain-Adapted LLMs for Chip Design(2024)
- Unleashing Energy-Efficiency: Neural Architecture Search without Training for Spiking Neural Networks on Loihi Chip(2024)
- Quantization-Aware Training of Spiking Neural Networks for Energy-Efficient Spectrum Sensing on Loihi Chip(2024)
- Legendre-SNN on Loihi-2: Evaluation and Insights(2024)
- Are SNNs Truly Energy-efficient? — A Hardware Perspective(2024)
- Approximate Adder Tree Design with Sparsity-Aware Encoding and In-Memory Swapping for SRAM-based Digital Compute-In-Memory Macros(2024)
- Workload-Balanced Pruning for Sparse Spiking Neural Networks(2024)
- An all integer-based spiking neural network with dynamic threshold adaptation(2024)
Machine Learning
- Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures(2009)
- ImageNet Classification with Deep Convolutional Neural Networks(2012)
- Training deep neural networks with low precision multiplications(2015)
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding(2016)
- Attention Is All You Need(2017)
- Training and Inference with Integers in Deep Neural Networks(2018)
- Improving Language Understanding by Generative Pre-Training(2018)
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference(2018)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding(2019)
- Language Models are Unsupervised Multitask Learners(2019)
- Q8BERT: Quantized 8Bit BERT(2019)
- Machine Learning at Facebook: Understanding Inference at the Edge(2019)
- Generating Long Sequences with Sparse Transformers(2019)
- Fast Transformer Decoding: One Write-Head is All You Need(2019)
- HAQ: Hardware-Aware Automated Quantization With Mixed Precision(2019)
- Language Models are Few-Shot Learners(2020)
- Training high-performance and large-scale deep neural networks with full 8-bit integers(2020)
- Longformer: The Long-Document Transformer(2020)
- ETC: Encoding Long and Structured Inputs in Transformers(2020)
- Big Bird: Transformers for Longer Sequences(2020)
- Long Range Arena: A Benchmark for Efficient Transformers(2020)
- Low Latency Deep Learning Inference Model for Distributed Intelligent IoT Edge Clusters(2021)
- Memory-efficient Transformers via Top-$k$ Attention(2021)
- I-BERT: Integer-only BERT Quantization(2021)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness(2022)
- OPT: Open Pre-trained Transformer Language Models(2022)
- GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale(2022)
- Neural Architecture Search for Spiking Neural Networks(2022)
- Rate Coding or Direct Coding: Which One is Better for Accurate, Robust, and Energy-efficient Spiking Neural Networks?(2022)
- Exploring Lottery Ticket Hypothesis in Spiking Neural Networks(2022)
- NITI: Training Integer Neural Networks Using Integer-only Arithmetic(2022)
- PocketNN: Integer-only Training and Inference of Neural Networks via Direct Feedback Alignment and Pocket Activations in Pure C++(2022)
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU(2023)
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models(2023)
- Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism(2023)
- Efficient Memory Management for Large Language Model Serving with PagedAttention(2023)
- Efficiently Scaling Transformer Inference(2023)
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(2023)
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache(2023)
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(2023)
- QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models(2023)
- Training Spiking Neural Networks Using Lessons From Deep Learning(2023)
- OneBit: Towards Extremely Low-bit Large Language Models(2024)
- Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond(2024)
- TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding(2024)
- Efficient Streaming Language Models with Attention Sinks(2024)
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization(2024)
- MiniCache: KV Cache Compression in Depth Dimension for Large Language Models(2024)
- Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(2024)
- ThinK: Thinner Key Cache by Query-Driven Pruning(2024)
- GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM(2024)
- MODEL TELLS YOU WHAT TO DISCARD: ADAPTIVE KV CACHE COMPRESSION FOR LLMS(2024)
- SparQ Attention: Bandwidth-Efficient LLM Inference(2024)
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving(2024)
- AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration(2024)
- SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models(2024)
- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models(2024)
- Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving(2024)
- Addition is All You Need for Energy-efficient Language Models(2024)
- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference(2024)
- KV Prediction for Improved Time to First Token(2024)
- NITRO-D: Native Integer-only Training of Deep Convolutional Neural Networks(2024)
RISC-V
- A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor with vector accelerators(2014)
- TAIGA: A new RISC-V soft-processor framework enabling high performance CPU architectural features(2017)
- Framework and Tools for Undergraduates Designing RISC-V Processors on an FPGA in Computer Architecture Education(2019)
- Open-Source RISC-V Processor IP Cores for FPGAs — Overview and Evaluation(2019)
- GVSoC: A Highly Configurable, Fast and Accurate Full-Platform Simulator for RISC-V based IoT Processors(2021)
- RVfpga: Using a RISC-V Core Targeted to an FPGA in Computer Architecture Education(2021)
- Design and verification of RISC-V CPU based on HLS and UVM(2021)
- A comparative survey of open-source application-class RISC-V processor implementations(2021)
- A review of CNN accelerators for embedded systems based on RISC-V(2022)
- A Survey of RISC-V CPU for IoT Applications(2022)